In this vignette, we will use the omnideconv package to deconvolve a
bulk RNA-seq dataset from 24 breast cancer patients with two different
methods (DWLS and BayesPrism). The datasets are from a recent breast
cancer study (Wu et al.
2021). This study provides access to a primary-tumor single
cell RNA-seq (scRNA-seq) dataset from 26 breast cancer patients across
three major cancer subtypes (ER+, HER2+, TNBC). The dataset includes
cell-type annotation for three resolution levels. In addition, the data
includes bulk RNA-seq sequencing for 24 of the patients. In this
chapter, we use multi-sample, breast-cancer scRNA-seq atlas (100,064
cells) as a reference to train the methods for the deconvolution of the
bulk RNA-seq samples. The single-cell and bulk RNA-seq data is deposited
on GEO, under accession number: GSE176078.
The single cell data comes with the author’s cell type
annotations.
We will need to download and unzip the datasets
(GSE176078_Wu_etal_2021_BRCA_scRNASeq.tar.gz,
GSE176078_Wu_etal_2021_bulkRNAseq_raw_counts.txt.gz), and
store them in the working directory. For this example analysis, we will
also need to retrieve the additional clinical information about the
patients – although it is not required by omnideconv. This is available
in the Supplementary
Table 1, included with the paper supplementary materials
library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
##
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
##
## intersect, t
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(omnideconv)
## Loading required package: snowfall
## Loading required package: snow
## → checking omnideconv environment and dependencies
## + "C:/Users/c1041161/AppData/Local/miniconda3/condabin/conda.bat" install --yes --prefix "C:/Users/c1041161/AppData/Local/miniconda3/envs/r-omnideconv" -c conda-forge "anndata>=0.7.5"
library(readxl)
library(ggpubr)
## Loading required package: knitr
Although not strictly required by omnideconv, we suggest performing quality control and filtering of the input scRNA-seq data according to the best practice (Heumos et al. 2023), to ensure the best training conditions for deconvolution algorithms. We first pre-process the single-cell dataset to remove low-quality cells. We will use the R package Seurat (Hao et al. 2023), which allows fast and easy manipulation of single-cell data. We will create a Seurat object with the cell counts and their metadata of interest curated by the authors, which include cell-type annotation on three levels of resolution:
single.cell.data <- Seurat::ReadMtx(
mtx = './Wu_etal_2021_BRCA_scRNASeq/count_matrix_sparse.mtx',
cells = './Wu_etal_2021_BRCA_scRNASeq/count_matrix_barcodes.tsv',
features = './Wu_etal_2021_BRCA_scRNASeq/count_matrix_genes.tsv',
feature.column = 1
)
single.cell.metadata <- read.table('./Wu_etal_2021_BRCA_scRNASeq/metadata.csv',
sep = ',',
header = TRUE,
row.names = 1)
single.cell.data = CreateSeuratObject(single.cell.data,
project='Wu_dataset',
assay='RNA',
min.cells = 0,
min.features = 1, meta.data = single.cell.metadata)
We can have an overview of the number of cells per cell type in the dataset:
table(single.cell.data$celltype_major, dnn = list("celltype_major")) %>%
as.data.frame(., responseName = "number_cells")
## celltype_major number_cells
## 1 B-cells 3206
## 2 CAFs 6573
## 3 Cancer Epithelial 24489
## 4 Endothelial 7605
## 5 Myeloid 9675
## 6 Normal Epithelial 4355
## 7 Plasmablasts 3524
## 8 PVL 5423
## 9 T-cells 35214
In order to remove low quality cells, we will follow the best practices for single cell normalization (Heumos et al. 2023). We will perform quality control on each cell by considering metrics such as the number of total counts, the number of expressed features (genes), and the fraction of mitochondrial genes per cell. The metric considered is the MAD, (Mean Absolute Deviation), computed as \(MAD = median(|X_i - median(X)|)\) with \(X_i\) being the respective metric of an observation (cell). We will consider as outliers cells that have \(X_i < n*|MAD|\), with \(n=5\) for number of counts and number of features, and n=3 for the fraction of mitochondrial genes.
is_outlier <- function(SeuratObject, metric, nmads){
eval(parse(text = paste0("M <- SeuratObject$",metric)))
outlier <- (M < median(M) - nmads * mad(M)) | (M > median(M) + nmads * mad(M))
return(outlier)
}
check_outliers_nFeature <- is_outlier(single.cell.data, 'nFeature_RNA', 5)
check_outliers_nCount <- is_outlier(single.cell.data, 'nCount_RNA', 5)
check_outliers_mito <- is_outlier(single.cell.data, 'percent.mito', 3)
non_outliers_nFeature <- names(check_outliers_nFeature[!check_outliers_nFeature])
non_outliers_nCount <- names(check_outliers_nCount[!check_outliers_nCount])
non_outliers_mito <- names(check_outliers_mito[!check_outliers_mito])
We will retain only those that satisfy all three of the conditions described above.
non_outliers <- intersect(non_outliers_nFeature, non_outliers_nCount) %>%
intersect(non_outliers_mito)
single.cell.data <- subset(single.cell.data, cells = non_outliers)
as.data.frame(table(single.cell.data$celltype_major, dnn = list("celltype_major")), responseName = "number_cells")
## celltype_major number_cells
## 1 B-cells 3150
## 2 CAFs 6154
## 3 Cancer Epithelial 16613
## 4 Endothelial 6991
## 5 Myeloid 8855
## 6 Normal Epithelial 3533
## 7 Plasmablasts 3164
## 8 PVL 5120
## 9 T-cells 34991
We will now read in the bulk sequencing data file, which consists of 24 samples.
bulk.data <- read.table('C:/Users/c1041161/book_chapter/GSE176078_Wu_etal_2021_bulkRNAseq_raw_counts.txt', skip=1)
header <- read.table('C:/Users/c1041161/book_chapter/GSE176078_Wu_etal_2021_bulkRNAseq_raw_counts.txt',
header = FALSE, nrows = 1, skipNul = TRUE, sep='\t')
colnames(bulk.data) <- c('Genes', gsub('A|N', '', header[2:25]))
bulk.data <- bulk.data[bulk.data$Genes != '', ]
bulk.data <- column_to_rownames(bulk.data, 'Genes')
bulk.data <- as.matrix(bulk.data)
The various methods included in omnideconv rely on the single cell dataset that will be used to train them for the deconvolution of those specific cell types. This training involves the optimization of internal features of the methods and can happen in different ways. Some methods use the single cell data to build a ‘signature matrix’, i.e. a reduced transcriptional fingerprints of the cell types provided, while others use this data in a statistical or deep learning model. Since single cell datasets can often encompass thousands of cells, we will need to subsample it in order to be able to run the analysis on our machines. In this case we will retain a maximum of 200 cells per cell type, but this step can be costumed, or eventually skipped, depending on the computational resources available.
set.seed(42)
max_cells_per_celltype = 200
sampled.metadata <- single.cell.data@meta.data %>%
rownames_to_column(., 'barcode') %>%
group_by(., celltype_major) %>%
nest() %>%
mutate(n = map_dbl(data, nrow)) %>%
mutate(n = min(n, max_cells_per_celltype)) %>%
ungroup() %>%
mutate(samp = map2(data, n, sample_n)) %>%
select(-data) %>%
unnest(samp)
single.cell.data.sampled <- subset(single.cell.data, cells = sampled.metadata$barcode)
as.data.frame(table(single.cell.data.sampled$celltype_major, dnn = list("celltype_major")), responseName = "number_cells")
## celltype_major number_cells
## 1 B-cells 200
## 2 CAFs 200
## 3 Cancer Epithelial 200
## 4 Endothelial 200
## 5 Myeloid 200
## 6 Normal Epithelial 200
## 7 Plasmablasts 200
## 8 PVL 200
## 9 T-cells 200
Each methods has different requirements, but in general to compute the deconvolution results we will need the single cell counts matrix, the cell type annotations and the information on the individual/experiment fom which the cells were retrieved (batch ID).
counts.matrix <- as.matrix(single.cell.data.sampled@assays$RNA$counts)
cell.type.annotations <- single.cell.data.sampled$celltype_major
batch.ids <- single.cell.data.sampled$orig.ident
Now we’re going to deconvolute the bulk dataset with different methods. The first one we are going to use is called DWLS (Tsoucas et al. 2019) and performs the deconvolution in a two-steps process. First, the single cell data is used to build a signature matrix using the omnideconv function build_model. DWLS looks for differentially expressed genes that discriminate across cell types, and can do so with two approaches based either on the Seurat (Hao et al. 2023) “bimod” test (McDavid et al. 2012) or on MAST (Finak et al. 2015). MAST improves the former model, but has an increased computational requirement (Nault et al. 2022). The authors recommend using this method on the smaller datasets, and to switch to Seurat if the analysis with MAST cannot be completed. To reduce MAST’s computational time, we introduced a second version of the MAST-based function (mast_optimized) that speeds up the process compared to the original implementation:
# We need to insert the normalization as well
signature.matrix.dwls <- omnideconv::build_model(single_cell_object = counts.matrix,
cell_type_annotations = cell.type.annotations,
method = 'dwls',
dwls_method = 'mast_optimized')
This signature is optimized so that the genes selected are the ones that help to discriminate across cell types.
head(signature.matrix.dwls)
## Gene.ID Endothelial CAFs PVL B.cells T.cells Myeloid Normal.Epithelial
## 1 ACKR1 15.230 0.075 0.040 0.025 0.005 0.020 0.095
## 2 PLVAP 10.680 0.110 0.040 0.015 0.005 0.030 0.105
## 3 VWF 6.065 0.065 0.020 0.000 0.020 0.000 0.010
## 4 RAMP2 8.665 0.210 0.025 0.005 0.005 0.005 0.155
## 5 CLDN5 5.190 0.115 0.025 0.000 0.000 0.010 0.020
## 6 AQP1 6.450 0.160 0.125 0.015 0.000 0.015 0.095
## Plasmablasts Cancer.Epithelial
## 1 0.010 0.000
## 2 0.005 0.030
## 3 0.005 0.015
## 4 0.020 0.195
## 5 0.000 0.010
## 6 0.000 0.030
The signature is now used for the deconvolution of the bulk RNAseq,
which is performed with the omnideconv function
deconvolute. DWLS computes the cell fractions performing
one of Ordinary Least Squares (OLS) Regression, Support Vector
Regression (SVR) or the Dampened Weighted Least Squares Regression
(DWLS) that was introduced with the method. This last regression method
is shown to outperform the others when it comes to the detection of rare
cell types:
deconvolution.results.dwls <- deconvolute(bulk_gene_expression = bulk.data,
signature = signature.matrix.dwls,
method='dwls',
dwls_submethod = 'DampenedWLS')
We will now obtain, for every sample, a set of cell type fractions for each cell type that was included in the provided single cell dataset.
head(deconvolution.results.dwls)
## B-cells CAFs Cancer Epithelial Endothelial Myeloid
## CID3586 1.582901e-01 0.2518742 0.09114178 0.06271542 0.042637219
## CID3921 1.681912e-01 0.2021584 0.08099052 0.09070612 0.017635665
## CID3941 3.711408e-02 0.2214264 0.27247883 0.07200429 0.040131349
## CID3948 1.236697e-01 0.1220122 0.39791826 0.02523021 0.002531945
## CID3963 2.734761e-02 0.2066542 0.01207755 0.05969586 0.131974995
## CID4066 2.742824e-11 0.4308795 0.15812117 0.09552470 0.002181170
## Normal Epithelial Plasmablasts PVL T-cells
## CID3586 1.547317e-01 -2.344791e-33 0.03221636 0.20639327
## CID3921 1.011845e-01 2.417927e-04 0.04400799 0.29488375
## CID3941 2.792117e-01 2.767974e-22 0.06969082 0.00794252
## CID3948 2.199138e-16 1.743133e-03 0.05252235 0.27437215
## CID3963 3.416887e-01 -1.008382e-21 0.03098350 0.18957754
## CID4066 2.084751e-01 -7.112521e-21 0.03140592 0.07341250
We can also visualise the results as a barplot trough the built-in
plot_deconvolution function
omnideconv::plot_deconvolution(list('dwls' = deconvolution.results.dwls), "bar", "method", "Spectral")
The third method we will use is BAyesPrism (Chu et al. 2022). This method is based on a Bayesian framework and models the transcriptomic expression observed in the scRNA-seq dataset. It then uses this information to dissect te bulk RNA-seq.
# BayesPrism deconvolution
deconvolution.results.bayesprism <- deconvolute(bulk_gene_expression = bulk.data,
single_cell_object = counts.matrix,
cell_type_annotations = cell.type.annotations,
method = 'bayesprism',
n_cores=12)
We can visualize the results as before:
omnideconv::plot_deconvolution(list('bayesprism' = deconvolution.results.bayesprism), "bar", "method", "Spectral")
The considered single-cell breast cancer dataset includes cell-type
annotations at three levels of resolution: celltype_major,
celltype_minor, and celltype_subset, which
distinguish 9, 29, and 58 cell types respectively. The different
cell-type annotations can be accessed with:
single.cell.data$celltype_major # Major annotation
single.cell.data$celltype_minor # Minor annotation
single.cell.data$celltype_subset # Subset annotation
These additional annotations provide a cell-type classification at a
finer resolution. For instance, at the celltype_major
level, we only have the T cell population, while at the
celltype_minor level, we can distinguish between CD4+ and
CD8+ T cells. In the following, we will again perform deconvolution
analysis with DWLS but, this time, using the celltype_minor
information. We will subsample the dataset as before, this time
considering the second level of resolution for the cell types, and
extract the objects needed for deconvolution.
set.seed(42)
max_cells_per_celltype = 200
sampled.metadata <- single.cell.data@meta.data %>%
rownames_to_column(., 'barcode') %>%
group_by(., celltype_minor) %>%
nest() %>%
mutate(n = map_dbl(data, nrow)) %>%
mutate(n = min(n, max_cells_per_celltype)) %>%
ungroup() %>%
mutate(samp = map2(data, n, sample_n)) %>%
select(-data) %>%
unnest(samp)
single.cell.data.sampled <- subset(single.cell.data, cells = sampled.metadata$barcode)
as.data.frame(table(single.cell.data.sampled$celltype_minor, dnn = list("celltype_minor")), responseName = "number_cells")
## celltype_minor number_cells
## 1 B cells Memory 200
## 2 B cells Naive 200
## 3 CAFs MSC iCAF-like 200
## 4 CAFs myCAF-like 200
## 5 Cancer Basal SC 200
## 6 Cancer Cycling 200
## 7 Cancer Her2 SC 200
## 8 Cancer LumA SC 200
## 9 Cancer LumB SC 200
## 10 Cycling PVL 37
## 11 Cycling T-cells 200
## 12 Cycling_Myeloid 200
## 13 DCs 200
## 14 Endothelial ACKR1 200
## 15 Endothelial CXCL12 200
## 16 Endothelial Lymphatic LYVE1 183
## 17 Endothelial RGS5 200
## 18 Luminal Progenitors 200
## 19 Macrophage 200
## 20 Mature Luminal 200
## 21 Monocyte 200
## 22 Myoepithelial 200
## 23 NK cells 200
## 24 NKT cells 200
## 25 Plasmablasts 200
## 26 PVL Differentiated 200
## 27 PVL Immature 200
## 28 T cells CD4+ 200
## 29 T cells CD8+ 200
counts.matrix <- as.matrix(single.cell.data.sampled@assays$RNA$counts)
cell.type.annotations <- single.cell.data.sampled$celltype_minor
batch.ids <- single.cell.data.sampled$orig.ident
signature.matrix.dwls.minor <- omnideconv::build_model(single_cell_object = counts.matrix,
cell_type_annotations = cell.type.annotations,
method = 'dwls',
dwls_method = 'mast_optimized')
deconvolution.results.dwls.minor <- deconvolute(bulk_gene_expression = bulk.data,
model = signature.matrix.dwls.minor,
method='dwls',
dwls_submethod = 'DampenedWLS')
We can visualize the results as before:
omnideconv::plot_deconvolution(list('dwls' = deconvolution.results.dwls.minor), "bar", "method", "Spectral")
dwls.results.long <- deconvolution.results.dwls %>%
as.data.frame() %>%
rownames_to_column(., 'Sample') %>%
gather(., key = 'cell_type', value = 'dwls_fraction', -'Sample')
bayesprism.results.long <- deconvolution.results.bayesprism %>%
as.data.frame() %>%
rownames_to_column(., 'Sample') %>%
gather(., key = 'cell_type', value = 'bayesprism_fraction', -'Sample')
results.long <- full_join(dwls.results.long, bayesprism.results.long)
## Joining with `by = join_by(Sample, cell_type)`
ggplot(results.long, aes(x = dwls_fraction, y = bayesprism_fraction)) +
geom_abline(linetype = 'dashed') +
geom_point(aes(color = cell_type)) +
theme_bw()+
theme(legend.position = 'hide') +
facet_wrap(.~cell_type) +
ylab('BayesPrism fractions')+
xlab('DWLS fractions')
We can consider as well the metadata provided with the original paper, which include patient’s data, treatment details, and more. We can first harmonize the sample names, and then combine all this information with the deconvolution results in one dataframe.
patient.metadata <- read_excel("C:/Users/c1041161/book_chapter/41588_2021_911_MOESM4_ESM.xlsx", sheet = 1, skip = 3) %>%
select(., c(1, 4, 5, 12))
colnames(patient.metadata) <- c('Sample', 'Grade', 'Cancer_type', 'Treatment')
patient.metadata$Sample <- gsub('-', '', patient.metadata$Sample) %>%
paste0('CID', .)
patients.results <- rownames_to_column(as.data.frame(deconvolution.results.dwls), 'Sample') %>%
gather(., key='celltype', value='cell_fraction', -Sample) %>%
left_join(., patient.metadata)
## Joining with `by = join_by(Sample)`
as.data.frame(table(patient.metadata$Treatment, dnn = list("Treatment")), responseName = "number_patients")
## Treatment number_patients
## 1 Naïve 21
## 2 Treated 5
In this cohort, 5 out of 24 patients were treated with neoadjuvant and/or Paclitaxel therapy, while the remaining 19 were untreated. We can visualize the fractions deconvolved by DWLS for every cell type, stratifying the patients into treated and untreated ones
ggplot(patients.results,
aes(x = Treatment, y = cell_fraction, fill = Treatment)) +
geom_boxplot() +
facet_wrap(.~celltype, scales = 'free_y', ncol = 4) +
ylab('cell fractions') +
xlab('') +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
stat_compare_means(label='p.signif', method = "wilcox.test",
comparisons = list(c("Naïve", "Treated")),
hide.ns = TRUE) +
scale_y_continuous(expand = expansion(mult = c(0.05, 0.25)))
By considering deconvolution results based on the fine-grained annotations, we can quantify additional cell types including different molecular subtypes of tumor cells: basal, Her2, and luminal A and B. The classification of malignant epithelial cells was performed at the single-cell level in the original publication using a new tool, SCSubtype (Wu et al. 2021). In parallel, the authors stratified the bulk RNA-seq samples into tumor molecular subtypes (luminal-like A/B, HER2-enriched, basal-like and normal-like) using the PAM50 gene signature (Pu et al. 2020):
patient.pam50.typing <- read_excel("C:/Users/c1041161/book_chapter/41588_2021_911_MOESM4_ESM.xlsx",
sheet = 3, skip = 3) %>%
select(., c(1, 4))
colnames(patient.pam50.typing) <- c('Sample', 'PAM50_subtype')
as.data.frame(table(patient.pam50.typing$PAM50_subtype, dnn = list("PAM50_subtype")),
responseName = "number_patients")
## PAM50_subtype number_patients
## 1 Basal 6
## 2 Her2 1
## 3 LumA 5
## 4 LumB 4
## 5 Normal 2
## 6 Not available 2
patients.results.minor <- rownames_to_column(as.data.frame(deconvolution.results.dwls.minor), 'Sample') %>%
gather(., key='celltype', value='cell_fraction', -Sample) %>%
left_join(., patient.pam50.typing)
## Joining with `by = join_by(Sample)`
patients.results.minor$PAM50_subtype[is.na(patients.results.minor$PAM50_subtype)] <- 'Unkn.'
We can then visualize the distribution of the fractions estimated by DWLS for the different cancer-cell molecular subtypes, stratifying the samples according to the PAM50 classifications:
patients.results.minor %>%
filter(celltype %in% c('Cancer Basal SC', 'Cancer LumA SC',
'Cancer LumB SC', 'Cancer Her2 SC')) %>%
ggplot(., aes(x = PAM50_subtype, y = cell_fraction, fill = PAM50_subtype)) +
geom_boxplot() +
facet_wrap(.~celltype, scales = 'free_y', ncol = 2) +
ylab('cell fractions') +
xlab('') +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = 'None') +
scale_y_continuous(expand = expansion(mult = c(0.05, 0.25)))
We can see that DWLS identifies the correct cancer-cell molecular subtype as the most abundant in corresponding samples, although with some discrepancies, especially for luminal A cells found in normal samples and luminal B cells found in luminal A samples.